Skip to main content

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: timlee Date: Mar 11, 2016 9:44pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

Thanks. I would like a general solution, but here is an example:

https://archive.org/details/timlee126_yahoo_Tmp1

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Mar 12, 2016 10:44am
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

that one failed because one of our systems has been on life support for about a week and should be fixed by next week....i hope. i'm re-running it now in the hoe that it succeeds.

in general here are a few typical reasons why ocr might not happen or succeed:
1. the language is not ocrable
2. the text is hard to decipher due to the font style or it being handwritten.
3. because the condition of the scans is poor.
4. the book is too long. we can only handle up to 9,999 pages
5. the file is corrupt in some way
6. there is an issue with the resolution of the files

other than that it is likely a failure on our part. some thing fail because they have unusual issues or as with this current one a part of the system is down.

Reply to this post
Reply [edit]

Poster: timlee Date: Mar 12, 2016 3:05pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

Thanks.

In similar cases, can I rerun it, and how?

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Mar 12, 2016 3:41pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

better to let us know and have us rerun it. there are many reasons they can fail. i just listed the more common ones.

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Mar 12, 2016 4:06pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

ok, your book is done now.

Reply to this post
Reply [edit]

Poster: timlee Date: Mar 12, 2016 9:16pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

Thanks. But the OCRed text is not put back to the pdf file. A pdf file with OCR text is usually created, but not in this case.

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Mar 13, 2016 9:22am
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

the system never alters the uploaded file and since it was detected as a text.pdf it would not create another one.

Reply to this post
Reply [edit]

Poster: timlee Date: Mar 13, 2016 10:03am
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

I meant a new pdf file is created with original scanned images and OCR text, not altering the original pdf file.

Is the reason because the pdf file is detected as text pdf instead of Image contained pdf?
I remember there is no option for me when I uploaded the file.

The file was still OCRed, but the output is a text file not put back to a new pdf file.
Now I have changed the the metadata option from text pdf to image contained pdf, but no new pdf file is created for both original pdf file and OCR text. How can I make it create such a new pdf file?


Same problem happened to another file https://archive.org/details/timlee126_yahoo_All_201603

Thanks.

This post was modified by timlee on 2016-03-13 17:03:04

Reply to this post
Reply [edit]

Poster: Jeff Kaplan Date: Mar 13, 2016 12:17pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

yes, it needs to be an image container to produce a text pdf. i reran it and now there is one at https://archive.org/download/timlee126_yahoo_All_201603/all_text.pdf

Reply to this post
Reply [edit]

Poster: timlee Date: Mar 13, 2016 3:09pm
Forum: texts Subject: Re: How to make sure a scanned pdf file will be OCRed?

Thanks.

The first file still has no new pdf file created with OCR text. https://archive.org/details/timlee126_yahoo_Tmp1
Can you also rerun it?

This happen quite often recently. I don't want to bother you often.


Is it possible that I can specify an original pdf to be image container pdf, and thus will OCR the pdf? (Both at the time when I upload the file, and after that and when I find no pdf file with OCR text is created)

This post was modified by timlee on 2016-03-13 21:58:06

This post was modified by timlee on 2016-03-13 22:09:07